Automated Blog Classification: Challenges and Pitfalls

نویسندگان

  • Hong Qu
  • Andrea La Pietra
  • Sarah S. Poon
چکیده

Blogs are difficult to categorize by humans and machines alike, because they are written in a capricious style. In the early days of web, directories maintain by humans could not keep up millions the websites; likewise, blog directories cannot keep up with the explosive growth of the blogsphere. This paper investigates the efficacy of using machine learning to categorize blogs. We design a text classification experiment to categorize one hundred and twenty blogs into four topics: personal diary, news, political, and sports. The baseline feature is unigrams weighed by TF-IDF, which yielded 84% accuracy. We analyze the corpus, features, and result data. Our analysis leads us to believe that blog taxonomies need to support polyhierarchy—a given blog may be correctly classified under more than one category.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Social Media for Service Innovations: Challenges and Pitfalls

This article investigates how social software such as blogs can be used to collect ideas generated by the users in the service innovation process. After a theoretical discussion of user involvement and more specifically user involvement using social software and interactive web-tools, the article reports the results from a field experiment at a university library. In the experiment, a blog was ...

متن کامل

IMPACTS AND CHALLENGES OF CLOUD COMPUTING FOR SMALL AND MEDIUM SCALE BUSINESSES IN NIGERIA

Cloud computing technology is providing businesses, be it micro, small, medium, and large scale enterprises with the same level playing grounds. Small and Medium enterprises (SMEs) that have adopted the cloud are taking their businesses to greater heights with the competitive edge that cloud computing offers. The limitations faced by (SMEs) in procuring and maintaining IT infrastructures has be...

متن کامل

Automated classification of pulmonary nodules through a retrospective analysis of conventional CT and two-phase PET images in patients undergoing biopsy

Objective(s): Positron emission tomography/computed tomography (PET/CT) examination is commonly used for the evaluation of pulmonary nodules since it provides both anatomical and functional information. However, given the dependence of this evaluation on physician’s subjective judgment, the results could be variable. The purpose of this study was to develop an automated scheme for the classific...

متن کامل

A Methodology to Cluster low-quality Data in the Web

Analyzing and classifying web content is a task that has been attracting an increasing amount of interest in recent years. However there are additional challenges to face with user generated content emanating from Web 2.0 applications such as blogs, commentaries, reviews etc. The typical characteristics of this information include features such as shortness, overlapping vocabulary, and vocabula...

متن کامل

Mining Commonsense Knowledge From Personal Stories in Internet Weblogs

Recent advances in automated knowledge base construction have created new opportunities to address one of the hardest challenges in Artificial Intelligence: automated commonsense reasoning. In this paper, we describe our recent efforts in mining commonsense knowledge from the personal stories that people write about their lives in their Internet weblogs. We summarize three preliminary investiga...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006